As we have seen earlier, genomic scores are often stored in wiggle format.
A perhaps more common human readable format is bedGraph.
Genomic Scores are heavily used in Genomics and High throughput sequencing as they offer a simple mechanism to review a defined metric over the linear genome at a specified resolution.
From our last session we identified Myc peaks within the Igfbp2 locus and in IGV compared Myc ChIP-seq signal from Encode over our peaks.
Two popular Bioconductor packages for dealing with Genomics Scores are:
Now we have the package installed, we can load the library rtracklayer which we will use to import and export from/to bedGraph and bigWig.
library(rtracklayer)We will also be making use of the functions in the GenomicRanges package. We dont need to load GenomicRanges directly here because the rtracklayer does this for us.
Package dependencies and imports allow one package to make use of functions from another.
The rtracklayer package provides functions to import genomic scores from a bedGraph using the import.bedGraph() function.
myBedG <- import.bedGraph("../../Data/TSS_ENCFF940MBK.bedGraph")Because we only have 4 columns in a bedGraph and no strand information, the GRanges intervals are unstranded with * in their strand column
strand(myBedG)## factor-Rle of length 2161 with 1 run
## Lengths: 2161
## Values : *
## Levels(3): + - *
The import bigWig’s genomic scores are again imported as a GRanges object containing the same information as the imported bedGraph.
myBigWig[1:3]## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 [ 1, 72811054] * | 0
## [2] chr1 [72811055, 72811119] * | 0.690400004386902
## [3] chr1 [72811120, 72811145] * | 0.578019976615906
## -------
## seqinfo: 54 sequences from an unspecified genome
We can however specify the type of objects we would like to return from the import.bedGraph and import.bw functions.
Here we will import the bigWig as a object we have briefly seen, the Rle object (run length encoding). Here we have an Rlelist (a list of Rle objects)
myBigWig <- import.bw("../../Data/TSS_ENCFF940MBK.bw",
as = "RleList")
class(myBigWig)## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
Run length encoding allows for a very efficient storage of long stretchs of repeated values.
We have already seen an rle in our cigar string from SAM files.
For genomic scores we will be storing long stretchs of numbers as an Rle.
To store genomic scores across chromosomes/contig we will use an RleList.
Creating an Rle is straightforward. We can simply supply a numeric vector of numbers we wish to compress to the Rle() function.
myNumbers <- c(0,0,0,0,0,1,1,1,0,0,0,0,0)
Rle(myNumbers)## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 0
Rle allows for us to compress space needed to store genomic scores and so the Bioconductor Rle and RleList objects allows us make use of this in R.
We can see the compression of space in action by reviewing our imported genomic scores as an RleList objects.
myBigWig[1:2]## RleList of length 2
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 ... 122614997
## Values : 0 0.690400004386902 ... 0
##
## $chr10
## numeric-Rle of length 130694993 with 1 run
## Lengths: 130694993
## Values : 0
Now we have an Rle object of our genomic scores on chromosome 1 we can index it just like we have for standard vectors. Here i retrieve values for all basepairs between 1 and 10
chr1_rle[1:10]## numeric-Rle of length 10 with 1 run
## Lengths: 10
## Values : 0
We can convert Rle’s to standard vectors.
rleAsVector <- as.vector(chr1_rle[1:10])
rleAsVector## [1] 100 100 100 100 100 100 100 100 100 100
Many vector operations and function work with Rle objects.
These include
Logical operations can be used with Rle objects just as with vectors.
For Rle objects, logical opertations return a logical Rle instead of standard logical vector.
chr1_rle < 10## logical-Rle of length 195471971 with 20 runs
## Lengths: 10 72823081 55 ... 65 2 122627007
## Values : FALSE TRUE FALSE ... TRUE FALSE TRUE
Many functions providing summary statisitics are also available to us including mean(), max() and sum() functions.
mean(chr1_rle)## [1] 4.181372e-05
max(chr1_rle)## [1] 100
sum(chr1_rle)## [1] 8173.411
Summary functions return a summary for every Rle object (here for chromosomes) within the RleList.
chromosomeMax <- max(myBigWig)
chromosomeMax[1:4]## chr1 chr10 chr11 chr12
## 23.09015 0.00000 0.00000 0.00000
Now we can simply index our RleList object by providing our GRanges as an index in [] brackets.
rleOverGranges <- myBigWig[newMycPeaks]
rleOverGranges## RleList of length 2
## $chr1
## numeric-Rle of length 50 with 14 runs
## Lengths: 6 3 ... 1
## Values : 16.1809902191162 18.4152698516846 ... 21.8972606658936
##
## $chr1
## numeric-Rle of length 50 with 9 runs
## Lengths: 1 10 ... 2
## Values : 11.2515201568604 9.79598999023438 ... 7.09102010726929
We may wish to export our RleList or Rle objects back to a bigWig file.
We would need to convert any Rle objects to a RleList first in order to provide some chromosome/contig names. We can do this with the RleList() function and providing a chromosome/contig name to our Rle object.
myRleList <- RleList(chr1=chr1_rle)
myRleList## RleList of length 1
## $chr1
## numeric-Rle of length 195471971 with 122 runs
## Lengths: 10 72823081 ... 122627007
## Values : 100 0 ... 0
Now we have our RleList object we export this to a bigWig using the export.bw() function.
export.bw(myRleList,con="chr1_Myc.bw")When importing only a portion of a bigWig we simply need to specify a GRanges of regions we wish to retrieve to the BigWigSelection() function.
Here we will use the Myc Peaks in our window as the GRanges of regions for selection.
newMycPeaks## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 [72824021, 72824070] * |
## [2] chr1 [72844876, 72844925] * |
## name score
## <character> <numeric>
## [1] Myc_Ch12_1_withInput_Input_Ch12_peak_246 18.55062
## [2] Myc_Ch12_1_withInput_Input_Ch12_peak_247 9.27569
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths